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The Cerebras Wafer 
Scale Engine (WSE) 


The most powerful processor for AI 


46,225 mm" silicon 

1.2 trillion transistors 

400,000 Al optimized cores 

18 Gigabytes of On-chip Memory 
9 PByte/s memory bandwidth 
100 Pbit/s fabric bandwidth 
TSMC 16nm process 


WSE - 2D mesh of 400,000 fully programmable processing elements 
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Leverage 400,000 tightly-connected cores to accelerate deep learning 


* Use a blend of distribution strategies: all types of model parallel + data parallel 
* Rely on model parallel first, as it doesn't depend on batch size 
e Add data parallel for small models 


* Dynamically choose the execution strategy optimized for different models 


Data parallel 
Distribution across e e 
different devices with È 
traditional processors ф 


Model parallel, within a layer 


Gerebras Model parallel, layer-pipelined 


Leverage 400,000 tightly-connected cores to accelerate deep learning 


* Use a blend of distribution strategies: all types of model parallel + data parallel 
* Rely on model parallel first, as it doesn't depend on batch size 
e Add data parallel for small models 


* Dynamically choose the execution strategy optimized for different models 


Data parallel 


Distribution across 
processing elements 
of WSE 


Model parallel, within a layer 
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Model parallel within a layer 


Distribute execution of a single layer across multiple processing elements (PEs) 


* Compiler chooses an optimal number of PEs and optimal shape for every layer 
* Compute-heavy layers get larger PEs allocations 


Neural network 
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Example: FC Layer (GEMV) 


Input activation (X) 
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* Weights are stationary 
e Each PE holds a tile of weight matrix 


* Forward and backward pass share the 


same set of PEs 
Weight (W) * Input activation comes in from 


vertical/horizontal direction 


Output activation (Y) 


* Output activation goes out from 
horizontal/vertical direction 
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Example: FC Layer (GEMV) 


* Each PE works on a subset of input 
activation 


* An input activation element is multiplied to 
a column of the weight matrix 


* The results are accumulated to a set of 
accumulators (that is reset at the beginning) 
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Example: FC Layer (GEMV) 
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СФ * Fach PE works on a subset of input 
O9 activation 


O50 * An input activation element is multiplied to 
0-0 a column of the weight matrix 


0-0 * The results are accumulated to a set of 
ОФ accumulators (that is reset at the beginning) 
O20 
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Example: FC Layer (GEMV) 


e Each PE has a partial sum of a subset of 
result (output activation) 


* Partial sums are accumulated to 
produce the final result 
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• Latency of partial sum accumulation is 
mitigated with input activation 
streaming (GEMM) 
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Model parallel, layer-pipelined 


Distribute execution of multiple layers across different fabric sections and keep entire model 
in fast on-chip memory 


Neural network 
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Model parallel, layer-pipelined 


Distribute execution of multiple layers across different fabric sections and keep entire model 
in fast on-chip memory 


* Compiler maps layers to the fabric to optimize compute and communication 


• Adjacent layers typically placed next to each other 


Neural network 
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Model parallel, layer-pipelined 


Challenging on a traditional cluster: 

* Limited communication between devices 

* Work should be dividable into fixed units of compute 
* ML researcher should choose optimal distribution 


Traditional cluster 


Device 1 Device 2 


Neural network 


Device 3 Device 4 
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Model parallel, layer-pipelined 


Easy and efficient on CS-1: 

* Low-latency high-bandwidth communication between all cores 
* Flexible units of compute 

* Cerebras compiler automatically chooses optimal distribution 


Traditional cluster CS-1 


Virtual Virtual 


Device 1 Device 2 
Neural network 7 ^ device 1 device 2 
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Virtual Virtual 
device 4 device 3 


Device 3 Device 4 
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Data parallel 


Replicate layers for higher performance on small models 


* Use very small batch size (down to 1 sample) per replica 
* Enabled by high bandwidth low latency local memory 
* Result: medium effective batch size 


* Place replicas on adjacent fabric sections 
* Low synchronization overheads due to high-bandwidth low-latency connections between PEs 


Neural network 
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In summary — WSE uses a blend of parallel execution modes 


* Single algorithm uses both model and data parallelism in optimization 


* Execution strategies optimized for different neural networks 


Few large layers: mostly model More layers: Few small layers: 
parallel within each layer "more" layer-pipelined model parallel data parallel 
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In summary — WSE uses a blend of parallel execution modes 


* Single algorithm uses both model and data parallelism in optimization 


* Execution strategies optimized for different neural networks 


But scale in deep learning is not only about efficient distribution... 
it's also about compute flexibility. 
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Future path: larger and smarter models 


Brute-force scaling is historical path to better models. — "^^ 


* T-NLG 
á Megatron 


* This is challenging TA è T5-11B 
GPT-2 
* Memory needs to grow * 15-38 
* Compute needs to grow È 1000 
5 e ALBERT xlarge 
S 
" "^ s : La 5 100 , 
Algorithmic innovations give more efficient models.  * re Compute grows 


REC - BERT Base exponentially 
* These are promising but challenge existing hardware. m 


Memory grows linearly 


CS-1 delivers both. Extreme scale with fewer nodes. 1 
0 5000 10000 15000 20000 


Flexible compute for smarter, efficient models. e paraire 
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CS-1 is designed to unlock smarter techniques and scale 


CS-1 has a data flow architecture 


* Flexibility to stream token by token 
* |nherent sparsity harvesting 


CS-1 is a MIMD architecture 


* Can program each core independently 
* Perform different operations on different data 


CS-1 was built to enable the next generation of models otherwise limited today. 
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CS + new techniques è efficient, extreme-scale models 


Shared weights Y A Same accuracy in only ^2096 
eg ALBERT the size 

Dynamic depth: 

- i per n FLOPs reduction: 

- рег sequence == Y e рег batch: 11% 

- pertoken == * per seq: 20% 


e per token: 50% 
eg Universal Transformer 


Activation sparsity к=з Up to 50% FLOPs reduction at 
Eg dropout ý X ~ negligible accuracy loss 
Attention sparsity = Attention cost 

eg Sparse Transformers — v X м O(n?) > O(nyn) 
Irregularity Bigger bang for 

eg Evolved Transformer Y Y X м parameter buck 
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Summary 


* Cerebras WSE is a 2D mesh of 400,000 programmable processing elements 


* Cerebras Graph Compiler can automatically choose the optimal blend of parallel 
execution strategies for each given model 


* No communication or memory bottlenecks due to local memory and high- 
bandwidth, low-latency fabric 


* MIMD + data flow architecture provide unique flexibility to enable the next 
generation of models 


Performance of a cluster, ease of use of a single device, 
and unique flexibility 


Thank you 


natalia@cerebras.net 


